How Fair Use Applies in the New AI Era
The legal case for why training AI models is permissible under U.S. copyright law.
Driving the news: As debates over artificial intelligence intensify, critics are questioning whether training large language models (LLMs) on publicly available data violates U.S. copyright law. But legal precedent — and a closer look at how these models actually work — tells a different story.
Why it matters: Misunderstanding how AI models learn has led to pressure to restrict access to training data. That could seriously hinder innovation across sectors like health care, national security, and scientific research. Read why U.S. copyright law supports innovation.
How it works: Generative AI models don’t memorize or reproduce training data. They analyze large datasets to learn patterns in language — enabling them to generate new content based on statistical relationships between words and ideas. Read more on how AI models learn.
- Tokenization: The first step in training a large language model is breaking down text into small units called “tokens” — typically words, subwords, or characters. This allows the model to process and understand language at a granular level.
- Embeddings: Each token is then converted into a numerical value, known as an embedding, which captures the token’s meaning and statistical relationship to other words in the dataset.
- Pattern learning: The model uses these embeddings to identify patterns in language and predict how words relate to one another. Rather than storing or recalling specific passages, it generates new content based on these learned relationships.
What the law says: U.S. copyright law protects creative expression, not facts, ideas, or language structures — the very elements AI models use. The fair use doctrine allows transformative uses of copyrighted works, especially when the new use serves a different purpose and doesn’t compete with the original.
- Established precedent: Courts have consistently upheld fair use in cases involving transformative technologies — including internet search engines and plagiarism detection tools — that rely on large-scale data processing for new purposes.
- Recent rulings: In two recent cases, courts ruled that training AI models on copyrighted material can qualify as fair use, reinforcing that this kind of learning is both legal and distinct from direct copying.
- Competition isn’t infringement: U.S. copyright law does not protect against new forms of competition. It only prohibits the copying of original expression — not the development of innovative tools that serve a different function.
Read more about misconceptions on AI training and copyright.
The big picture: Restricting AI model training based on a misreading of copyright law risks more than legal confusion — it threatens U.S. competitiveness in a fast-moving global race.
- Breakthroughs depend on data: Innovation across sectors — from health care to cybersecurity — relies on access to large, diverse, high-quality datasets that fuel AI development.
- Restricting access slows progress: Limiting data would significantly slow advancement in critical industries, hindering everything from scientific research to real-time decision-making tools.
- Law supports technological advances: U.S. copyright law was created to encourage creativity, learning, and innovation — not to block the development of transformative technologies like AI.
Between the lines: The legal tools to protect creators and support innovation already exist. The challenge now is ensuring policy keeps pace with the new era of AI.